This study is conducted based on a dataset including more than 120,000 records. Each record includes information about the passenger and his/her rating about the various attributes about the flight.
The dataset is taken from the resources of mavenanalytics.io and also its origin is from kaggle. Because it is not clearly introduced, I, the author of this project, assumes that each record demonstrates a unique passenger and a unique flight.
In this project, the objectives are to understand the drivers of satisfaction of passengers in US Airline and to create a model to predict the satisfaction of the customers.
As introduced above, this dataset is recommended by mavenanalytics.io and it was posted originally on Kaggle by John D. https://www.kaggle.com/datasets/johndddddd/customer-satisfaction
For the content, it includes:
The below attributes are evaluated in scale from 1 (lowest) to 5 (highest) and 0 means "not applicable".
Lastly, the overall satisfaction is a binary variable. Later, it is going to be used as a dependent variable for the analysis.
# Set working directory
import os
dir = os.getcwd()
os.chdir(dir)
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from sklearn import metrics
import plotly.graph_objects as go
from sklearn.metrics import classification_report
from sklearn.inspection import permutation_importance
df = pd.read_csv("data/airline_passenger_satisfaction.csv")
df.head()
| ID | Gender | Age | Customer Type | Type of Travel | Class | Flight Distance | Departure Delay | Arrival Delay | Departure and Arrival Time Convenience | ... | On-board Service | Seat Comfort | Leg Room Service | Cleanliness | Food and Drink | In-flight Service | In-flight Wifi Service | In-flight Entertainment | Baggage Handling | Satisfaction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Male | 48 | First-time | Business | Business | 821 | 2 | 5.0 | 3 | ... | 3 | 5 | 2 | 5 | 5 | 5 | 3 | 5 | 5 | Neutral or Dissatisfied |
| 1 | 2 | Female | 35 | Returning | Business | Business | 821 | 26 | 39.0 | 2 | ... | 5 | 4 | 5 | 5 | 3 | 5 | 2 | 5 | 5 | Satisfied |
| 2 | 3 | Male | 41 | Returning | Business | Business | 853 | 0 | 0.0 | 4 | ... | 3 | 5 | 3 | 5 | 5 | 3 | 4 | 3 | 3 | Satisfied |
| 3 | 4 | Male | 50 | Returning | Business | Business | 1905 | 0 | 0.0 | 2 | ... | 5 | 5 | 5 | 4 | 4 | 5 | 2 | 5 | 5 | Satisfied |
| 4 | 5 | Female | 49 | Returning | Business | Business | 3470 | 0 | 1.0 | 3 | ... | 3 | 4 | 4 | 5 | 4 | 3 | 3 | 3 | 3 | Satisfied |
5 rows × 24 columns
# Check data shape
df.shape
(129880, 24)
# Check names of columns
df.columns
Index(['ID', 'Gender', 'Age', 'Customer Type', 'Type of Travel', 'Class',
'Flight Distance', 'Departure Delay', 'Arrival Delay',
'Departure and Arrival Time Convenience', 'Ease of Online Booking',
'Check-in Service', 'Online Boarding', 'Gate Location',
'On-board Service', 'Seat Comfort', 'Leg Room Service', 'Cleanliness',
'Food and Drink', 'In-flight Service', 'In-flight Wifi Service',
'In-flight Entertainment', 'Baggage Handling', 'Satisfaction'],
dtype='object')
# Adjust the columns names, replacing spaces and hyphens to avoid potential problem with spaces in variable names
df.columns = df.columns.str.replace(' ','_')
df.columns = df.columns.str.replace('-','')
# Check data type of each column
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 129880 entries, 0 to 129879 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 129880 non-null int64 1 Gender 129880 non-null object 2 Age 129880 non-null int64 3 Customer_Type 129880 non-null object 4 Type_of_Travel 129880 non-null object 5 Class 129880 non-null object 6 Flight_Distance 129880 non-null int64 7 Departure_Delay 129880 non-null int64 8 Arrival_Delay 129487 non-null float64 9 Departure_and_Arrival_Time_Convenience 129880 non-null int64 10 Ease_of_Online_Booking 129880 non-null int64 11 Checkin_Service 129880 non-null int64 12 Online_Boarding 129880 non-null int64 13 Gate_Location 129880 non-null int64 14 Onboard_Service 129880 non-null int64 15 Seat_Comfort 129880 non-null int64 16 Leg_Room_Service 129880 non-null int64 17 Cleanliness 129880 non-null int64 18 Food_and_Drink 129880 non-null int64 19 Inflight_Service 129880 non-null int64 20 Inflight_Wifi_Service 129880 non-null int64 21 Inflight_Entertainment 129880 non-null int64 22 Baggage_Handling 129880 non-null int64 23 Satisfaction 129880 non-null object dtypes: float64(1), int64(18), object(5) memory usage: 23.8+ MB
# Check basic statistics of numerical variables
df.describe()
| ID | Age | Flight_Distance | Departure_Delay | Arrival_Delay | Departure_and_Arrival_Time_Convenience | Ease_of_Online_Booking | Checkin_Service | Online_Boarding | Gate_Location | Onboard_Service | Seat_Comfort | Leg_Room_Service | Cleanliness | Food_and_Drink | Inflight_Service | Inflight_Wifi_Service | Inflight_Entertainment | Baggage_Handling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 129880.000000 | 129880.000000 | 129880.000000 | 129880.000000 | 129487.000000 | 129880.000000 | 129880.000000 | 129880.000000 | 129880.000000 | 129880.000000 | 129880.000000 | 129880.000000 | 129880.000000 | 129880.000000 | 129880.000000 | 129880.000000 | 129880.000000 | 129880.000000 | 129880.000000 |
| mean | 64940.500000 | 39.427957 | 1190.316392 | 14.713713 | 15.091129 | 3.057599 | 2.756876 | 3.306267 | 3.252633 | 2.976925 | 3.383023 | 3.441361 | 3.350878 | 3.286326 | 3.204774 | 3.642193 | 2.728696 | 3.358077 | 3.632114 |
| std | 37493.270818 | 15.119360 | 997.452477 | 38.071126 | 38.465650 | 1.526741 | 1.401740 | 1.266185 | 1.350719 | 1.278520 | 1.287099 | 1.319289 | 1.316252 | 1.313682 | 1.329933 | 1.176669 | 1.329340 | 1.334049 | 1.180025 |
| min | 1.000000 | 7.000000 | 31.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 32470.750000 | 27.000000 | 414.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 2.000000 | 3.000000 |
| 50% | 64940.500000 | 40.000000 | 844.000000 | 0.000000 | 0.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 4.000000 | 4.000000 | 4.000000 | 3.000000 | 3.000000 | 4.000000 | 3.000000 | 4.000000 | 4.000000 |
| 75% | 97410.250000 | 51.000000 | 1744.000000 | 12.000000 | 13.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 5.000000 | 4.000000 | 4.000000 | 4.000000 | 5.000000 | 4.000000 | 4.000000 | 5.000000 |
| max | 129880.000000 | 85.000000 | 4983.000000 | 1592.000000 | 1584.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 |
# Check NA in dataset
df.isnull().sum()
ID 0 Gender 0 Age 0 Customer_Type 0 Type_of_Travel 0 Class 0 Flight_Distance 0 Departure_Delay 0 Arrival_Delay 393 Departure_and_Arrival_Time_Convenience 0 Ease_of_Online_Booking 0 Checkin_Service 0 Online_Boarding 0 Gate_Location 0 Onboard_Service 0 Seat_Comfort 0 Leg_Room_Service 0 Cleanliness 0 Food_and_Drink 0 Inflight_Service 0 Inflight_Wifi_Service 0 Inflight_Entertainment 0 Baggage_Handling 0 Satisfaction 0 dtype: int64
From the investigations above, there is one issue with 393 null values at Arrival Delay, which should be changed to 0.
# Replace NAs at Arrival Delay with 0
df['Arrival_Delay'].fillna(0,inplace=True)
As mentioned above, 14 attributes from 'Departure_and_Arrival_Time_Convenience' to 'Baggage_Handling' having 5-point scale need adjusting. The 0 in the scale needs to be replaced by NA.
#Replace 0 with NA in the 14 attributes evaluated using 5-point scale
attributes = ['Departure_and_Arrival_Time_Convenience',
'Ease_of_Online_Booking','Checkin_Service',
'Online_Boarding','Gate_Location',
'Onboard_Service','Seat_Comfort',
'Leg_Room_Service',
'Cleanliness',
'Food_and_Drink',
'Inflight_Service',
'Inflight_Wifi_Service',
'Inflight_Entertainment',
'Baggage_Handling']
df[attributes] = df[attributes].replace(0, np.nan)
df.isnull().sum()
ID 0 Gender 0 Age 0 Customer_Type 0 Type_of_Travel 0 Class 0 Flight_Distance 0 Departure_Delay 0 Arrival_Delay 0 Departure_and_Arrival_Time_Convenience 6681 Ease_of_Online_Booking 5682 Checkin_Service 1 Online_Boarding 3080 Gate_Location 1 Onboard_Service 5 Seat_Comfort 1 Leg_Room_Service 598 Cleanliness 14 Food_and_Drink 132 Inflight_Service 5 Inflight_Wifi_Service 3916 Inflight_Entertainment 18 Baggage_Handling 0 Satisfaction 0 dtype: int64
Now we have many NAs in 13 variables, later before proceeding with fitting the model, those NAs will be imputed.
# Set the 'ID' column as the index
df.set_index('ID', inplace=True)
# Remove duplication if there is any
df = df.drop_duplicates()
# Splitting the DataFrame into train and test sets (80% train, 20% test)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=199)
# Separating features (X) and target (Y) for both train and test sets
X_train = train_df.drop('Satisfaction', axis=1)
y_train = train_df['Satisfaction']
X_test = test_df.drop('Satisfaction', axis=1)
y_test = test_df['Satisfaction']
#Function to calculate percentage for pie chart
def make_autopct(values):
def my_autopct(pct):
total = sum(values)
val = int(round(pct*total/100.0))
return '{p:.2f}% ({v:d})'.format(p=pct,v=val)
return my_autopct
# Function to make ideal pie chart
def make_pie_chart(ax, df, var, title, var_order=None):
if ax is None: # In case the chart stands alone, not in a grid of charts
ax = plt.gca()
ax.set_title(title, size=20, y=1.5, x=0.5)
if var_order:
counts = df[var].value_counts().reindex(var_order)
else:
counts = df[var].value_counts()
ax.pie(counts, labels=counts.index, shadow=True,
autopct='%1.1f%%', pctdistance=0.5, radius=2, startangle=90)
# Pie chart comparing 3 classes
make_pie_chart(ax=None, df=train_df, var='Class', title='Class', var_order=None)
Most of the records in the dataset are about Business and Economy classes, their shares are almost equal to the other. Economy Plus has the smallest share as 7.2%.
# Correlation matrix heatmap
df_std = df.copy()
numerical_columns = df_std.select_dtypes(include=np.number).columns
scaler = StandardScaler()
df_std[numerical_columns] = scaler.fit_transform(df_std[numerical_columns])
plt.figure(figsize=(12, 6))
sns.heatmap(df_std[numerical_columns].corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix Heatmap')
plt.show()
All of the correlations are lower than 0.7, so there is no high correlation [0.7, 0.9] in the data.
After that, three classes are compared, so the dataset is split into 3 classes.
biz = train_df.query('`Class` == "Business"')
eco = train_df.query('`Class` == "Economy"')
eco_plus = train_df.query('`Class` == "Economy Plus"')
# Create a frame to put 3 subplots
fig, axs = plt.subplots(1, 3, figsize=(16.5, 5))
# Create consistent order for all charts
customer_type_order = ['Returning', 'First-time']
# Create one chart for each class
make_pie_chart(axs[0], biz, 'Customer_Type', 'Business', customer_type_order)
make_pie_chart(axs[1], eco, 'Customer_Type', 'Economy', customer_type_order)
make_pie_chart(axs[2], eco_plus, 'Customer_Type','Economy Plus', customer_type_order)
# Adjust layout to prevent clipping of labels
plt.tight_layout()
# Show the plots
plt.show()
Economy Plus is the smallest class in term of share and it also has the lowest proportion of new customers. The new triers are likely to choose Economy.
# Create a frame to put 3 subplots
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
# Create consistent order for all charts
travel_type_order = ['Business', 'Personal']
# Create one chart for each class
make_pie_chart(axs[0], biz, 'Type_of_Travel', 'Business', travel_type_order)
make_pie_chart(axs[1], eco, 'Type_of_Travel', 'Economy', travel_type_order)
make_pie_chart(axs[2], eco_plus, 'Type_of_Travel','Economy Plus', travel_type_order)
# Adjust layout to prevent clipping of labels
plt.tight_layout()
# Show the plots
plt.show()
The more expensive the class, the more likely people choose it for Business purpose. Business class, as its name indicates, are used solely for Business purpose.
# Create a frame to put 3 subplots
fig, axs = plt.subplots(1, 3, figsize=(15, 5)) # 1 row, 3 columns
# Create consistent order for all charts
gender_order = ['Male', 'Female']
# Create one chart for each class
make_pie_chart(axs[0], biz, 'Gender', 'Business', gender_order )
make_pie_chart(axs[1], eco, 'Gender', 'Economy', gender_order )
make_pie_chart(axs[2], eco_plus, 'Gender','Economy Plus', gender_order )
# Adjust layout to prevent clipping of labels
plt.tight_layout()
# Show the plots
plt.show()
There is no difference among gender distribution across classes.
# Function to make ideal box plot
def make_box_plot(ax, df, var, var_range, title):
ax.set_title(title, size=20, fontweight='bold', y=1.05)
sns.boxplot(x=var, data=df, ax=ax)
ax.set_xticks(var_range)
ax.set_xlabel(var, size=15)
# Create graphs for Age
fig, axs = plt.subplots(1, 3, figsize=(15, 5)) # 1 row, 3 columns
age_range = np.arange(0, 110, 10)
make_box_plot(axs[0], biz, 'Age', age_range, 'Business')
make_box_plot(axs[1], eco, 'Age', age_range, 'Economy')
make_box_plot(axs[2], eco_plus, 'Age', age_range, 'Economy Plus')
# Adjust layout to prevent clipping of labels
plt.tight_layout()
People from Business class are slightly older than people from the other two categories. Between Economy and Economy Plus, the age distribution are almost the same.
# Create graphs for Flight_Distance
fig, axs = plt.subplots(1, 3, figsize=(15, 5)) # 1 row, 3 columns
distance_range = np.arange(0, 5001, 500)
make_box_plot(axs[0], biz, 'Flight_Distance', distance_range, 'Business')
make_box_plot(axs[1], eco, 'Flight_Distance', distance_range, 'Economy')
make_box_plot(axs[2], eco_plus, 'Flight_Distance', distance_range, 'Economy Plus')
# Adjust layout to prevent clipping of labels
plt.tight_layout()
People in Business class have significant further distance than in Economy or Economy Plus. Their average distance is above 1500 miles while most of the journeys in Economy and Economy Plus are lower than 1000 miles.
In this section, I am going to focus on all of the evaluations done on 14 metrics to see if there is difference among the 3 classes.
# Set the labels and calculate the mean values for the radar chart
attribute_labels = ['Departure and Arrival Time Convenience',
'Ease of Online Booking','Check-in Service',
'Online Boarding','Gate Location',
'Onboard Service','Seat Comfort',
'Leg Room Service',
'Cleanliness',
'Food and Drink',
'In-flight Service',
'In-flight Wifi Service',
'In-flight Entertainment',
'Baggage Handling']
satisfaction = df[attributes+['Class']]
mean_values = satisfaction.groupby('Class').mean()
# Create the radar chart
fig = go.Figure()
fig.add_trace(go.Scatterpolar(
r=list(mean_values.loc['Business',]),
theta=attribute_labels,
fill='toself',
name='Business'
))
fig.add_trace(go.Scatterpolar(
r=list(mean_values.loc['Economy',]),
theta=attribute_labels,
fill='toself',
name='Economy'
))
fig.add_trace(go.Scatterpolar(
r=list(mean_values.loc['Economy Plus',]),
theta=attribute_labels,
fill='toself',
name='Economy Plus'
))
fig.update_layout(
polar=dict(
radialaxis=dict(
visible=True,
range=[0, 5]
)),
showlegend=True
)
fig.show()
Similar to the profile, there are similar pattern of evaluation between Economy and Economy Plus. Their evaluation are fluctuating around 3 points across metrics. Comparing to Business class, their satisfaction is lower in most of the attributes. Yet, people from all class have the same feeling about the Gate Location and In-flight Wifi Service. Interestingly, people in Business class are more disappointed about the Departure and Arrival Time Convenience.
In this section, I am going to create a logistic regression model to predict to predict if the customers are satisfied with the trips based on their profile and their evaluation of the trip components.
Firstly, I need to pre-proceed the predictors, my actions include:
def transform(df):
# Get numerical columns
numerical_columns = df.select_dtypes(include=np.number).columns
# Check for missing values
missing_values = df.isnull().sum()
# Handle missing values (if necessary)
if missing_values.any():
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
df[numerical_columns] = imp.fit_transform(df[numerical_columns])
# Standardize numerical features
scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])
# One-hot encode categorical features
df = pd.get_dummies(df)
return df
# Pre-proceed the train and test predictors
X_train = transform(X_train)
X_test = transform(X_test)
It is also necessary to check the balance of the dependent variable.
# Check balance of dependent variable
y_train.value_counts()
Satisfaction Neutral or Dissatisfied 58733 Satisfied 45171 Name: count, dtype: int64
Currently, it is mildly imbalance, so oversampling method will be executed.
# Oversampling to balance the dependent variable
smote = SMOTE(random_state=199)
X_train_oversampled, y_train_oversampled = smote.fit_resample(X_train, y_train)
After all the pre-processing is done, I start to fit the model, and predict the test data.
# Initiate the model
logreg = LogisticRegression(random_state=199, max_iter = 1000)
# Fit the model with data
logreg.fit(X_train_oversampled, y_train_oversampled)
# Predict the test
y_pred = logreg.predict(X_test)
# Confusion matrix
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Text(0.5, 427.9555555555555, 'Predicted label')
print(classification_report(y_test, y_pred))
precision recall f1-score support
Neutral or Dissatisfied 0.90 0.88 0.89 14719
Satisfied 0.85 0.88 0.86 11257
accuracy 0.88 25976
macro avg 0.88 0.88 0.88 25976
weighted avg 0.88 0.88 0.88 25976
Looking at the confusion Matrix
Misclassifications are fairly balanced between the two classes, although slightly more instances of "Neutral or Dissatisfied" are misclassified as "Satisfied" compared to the other way around.
For the Classification.
The model performs well overall, with high precision, recall, and F1-score for both classes.
The accuracy of 88% indicates that the model is effective in distinguishing between "Neutral or Dissatisfied" and "Satisfied" instances.
The class imbalance (more instances of "Neutral or Dissatisfied" than "Satisfied") does not seem to significantly affect the model's performance.
# Get the column index of positive class
positive_class_index = list(logreg.classes_).index('Satisfied')
# Select the probabilities associated with the positive class
y_pred_proba = logreg.predict_proba(X_test)[:, positive_class_index]
# Label the positive class
positive_class_label = 'Satisfied'
# Convert y_true to binary labels
y_true_binary = (y_test == positive_class_label).astype(int)
# Calculate ROC curve
fpr, tpr, _ = metrics.roc_curve(y_true_binary, y_pred_proba)
# Calculate AUC
auc = metrics.roc_auc_score(y_true_binary, y_pred_proba)
# Plot ROC curve
plt.plot(fpr, tpr, label="data 1, auc=" + str(auc))
plt.legend(loc=4)
plt.show()
Regarding the ROC, with an AUC of 0.948, it indicates that the model is performing well in terms of distinguishing between the classes, with strong discriminatory power.
Last but not least, I am going to check the importance of the variables using permutation method.
# Check the importance of the variables using permutation method
result = permutation_importance(logreg, X_test, y_test, n_repeats=10, random_state=199)
feature_importance = pd.DataFrame({'Feature': X_test.columns,
'Importance': result.importances_mean,
'Standard Deviation': result.importances_std})
feature_importance = feature_importance.sort_values('Importance', ascending=True)
ax = feature_importance.plot(x='Feature', y='Importance', kind='barh', figsize=(10, 6), yerr='Standard Deviation', capsize=4)
ax.set_xlabel('Permutation Importance')
ax.set_title('Permutation Importance with Standard Deviation')
Text(0.5, 1.0, 'Permutation Importance with Standard Deviation')
From the graph, it can be seen that the top important variables are the ones about the profile of the customers, such as class, type of travel, gender, customer type. This insight is helpful to predict the satisfaction based on who evaluate the journey. In other words, it is helpful for segment targeting problem.
However, what the airline can improve is the 14 attributes that people evaluate. Among those 14 attributes, Online Boarding, In-flight Wifi Service and On-board Service are the most important factors that lead to customers' satisfaction.
And the least important factors are In-flight Entertainment, Food and Drink, and Ease of Online Booking.